Analysis of Top Tracks in Spotify

Introduction

This is a basic analysis of the top 100 songs on Spotify for the year 2017. The audio features for each song were extracted using the Spotify Web API and the spotipy Python library. Credit goes to Spotify for calculating the audio feature values. This dataset is publicly available on Kaggle.

We will only look at a few columns that are of interest to us.

Data import and processing

Library imports:

library(dplyr)
library(forcats)
library(ggplot2)
library(knitr)
library(readr)

Data import:

df <- read_csv("spotify-2017.csv", 
    col_types = cols(mode = col_character()))
df <- df %>% mutate(mode = fct_recode(mode, 
                                      "Major" = "1.0",
                                      "Minor" = "0.0"))
kable(head(df))

id	name	artists	danceability	energy	key	loudness	mode	speechiness	acousticness	instrumentalness	liveness	valence	tempo	duration_ms	time_signature
7qiZfU4dY1lWllzX7mPBI	Shape of You	Ed Sheeran	0.825	0.652	1	-3.183	Minor	0.0802	0.581000	0.00e+00	0.0931	0.931	95.977	233713	4
5CtI0qwDJkDQGwXD1H1cL	Despacito - Remix	Luis Fonsi	0.694	0.815	2	-4.328	Major	0.1200	0.229000	0.00e+00	0.0924	0.813	88.931	228827	4
4aWmUDTfIPGksMNLV2rQP	Despacito (Featuring Daddy Yankee)	Luis Fonsi	0.660	0.786	2	-4.757	Major	0.1700	0.209000	0.00e+00	0.1120	0.846	177.833	228200	4
6RUKPb4LETWmmr3iAEQkt	Something Just Like This	The Chainsmokers	0.617	0.635	11	-6.769	Minor	0.0317	0.049800	1.44e-05	0.1640	0.446	103.019	247160	4
3DXncPQOG4VBw3QHh3S81	I’m the One	DJ Khaled	0.609	0.668	7	-4.284	Major	0.0367	0.055200	0.00e+00	0.1670	0.811	80.924	288600	4
7KXjTSCq5nL1LoYtL7XAw	HUMBLE.	Kendrick Lamar	0.904	0.611	1	-6.842	Minor	0.0888	0.000259	2.03e-05	0.0976	0.400	150.020	177000	4

For this analysis, we will focus on mode, tempo, valence and loudness. Below are the details for these columns. For details on the remainder of the columns, see here.

name: Name of the song
artists: Artist(s) of the song
loudness: The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
mode: Mode indicates the modality (major or minor) of a track, the type of scale from which its melodic content is derived.
valence: A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
tempo: The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.

Differences in tempo between songs in major and minor keys

First, we want to test if there is a difference in the distribution of tempo between songs in a major key and songs in a minor key. Let’s look at this in a histogram:

ggplot(data = df, mapping = aes(x = tempo)) +
    geom_histogram(aes(fill = mode)) +
    facet_wrap(~ mode)

The distribution in both plots look quite similar, with a large peak around 100 and maybe a smaller peak around 130-150.

We can plot both these distributions as a density plot:

ggplot(data = df, mapping = aes(x = tempo)) +
    geom_density(aes(col = mode))

The two distributions look very similar.

Let’s compute the mean tempo for each of the modes:

df %>% group_by(mode) %>%
    summarize(mean_tempo = mean(tempo))

## # A tibble: 2 x 2
##   mode  mean_tempo
##   <fct>      <dbl>
## 1 Minor       116.
## 2 Major       122.

Test if the difference in mean scores for the sexes is significant or not with the \(t\)-test:

major_data <- (df %>% filter(mode == "Major"))$tempo
minor_data <- (df %>% filter(mode == "Minor"))$tempo
t.test(major_data, minor_data, alternative = "two.sided")

## 
##  Welch Two Sample t-test
## 
## data:  major_data and minor_data
## t = 1.0377, df = 97.475, p-value = 0.302
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.152959 16.446581
## sample estimates:
## mean of x mean of y 
##  121.5741  115.9273

The \(p\)-value for this test is around 0.30, so we wouldn’t reject the null hypothesis in favor of the alternative hypothesis.

Test if the distribution of tempo for songs in major key is significantly different from the distribution of tempo for songs in minor key with the Kolmogorov-Smirnov test:

ks.test(major_data, minor_data, alternative = "two.sided")

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  major_data and minor_data
## D = 0.13136, p-value = 0.7946
## alternative hypothesis: two-sided

The p-value for this test is around 0.80, so we don’t have enough evidence to reject the null hypothesis (i.e. the data we have could have reasonably come from the distribution under the null hypothesis).

Relationship between loudness and valence

Scatterplot of valence vs. loudness:

ggplot(data = df, mapping = aes(x = loudness, y = valence)) +
    geom_point()

Let’s fit a linear model of valence vs. loudness. Expectation: The louder the song, the happier it is. Hence, we expect a positive relationship.

lm(valence ~ loudness, data = df)

## 
## Call:
## lm(formula = valence ~ loudness, data = df)
## 
## Coefficients:
## (Intercept)     loudness  
##     0.79386      0.04897

Get more information on the linear fit with summary:

fit <- lm(valence ~ loudness, data = df)
summary(fit)

## 
## Call:
## lm(formula = valence ~ loudness, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39188 -0.12018 -0.00328  0.14860  0.40816 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.79386    0.06570   12.08  < 2e-16 ***
## loudness     0.04897    0.01108    4.42 2.55e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1986 on 98 degrees of freedom
## Multiple R-squared:  0.1662, Adjusted R-squared:  0.1577 
## F-statistic: 19.54 on 1 and 98 DF,  p-value: 2.548e-05

From the summary, the correlation between valence and loudness is statistically significant.

Plot the linear fit along with the scatterplot:

ggplot(data = df, mapping = aes(x = loudness, y = valence)) +
    geom_point() +
    geom_smooth(method = "lm")

Modeling valence as a function of loudness and mode

Whether a song is in a major key or a minor key could affect the relationship between valence and loudness. Expectation: ???

ggplot(data = df, mapping = aes(x = loudness, y = valence, col = mode)) +
    geom_point() +
    facet_grid(. ~ mode)

First, let’s fit the additive model:

fit <- lm(valence ~ loudness + mode, data = df)
summary(fit)

## 
## Call:
## lm(formula = valence ~ loudness + mode, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3945 -0.1170 -0.0027  0.1498  0.4119 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.79121    0.06839  11.569  < 2e-16 ***
## loudness     0.04912    0.01118   4.394 2.85e-05 ***
## modeMajor    0.00605    0.04061   0.149    0.882    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1996 on 97 degrees of freedom
## Multiple R-squared:  0.1664, Adjusted R-squared:  0.1492 
## F-statistic: 9.684 on 2 and 97 DF,  p-value: 0.0001464

In this model, it seems like whether a song is in a major or minor key doesn’t make a big difference.

Next, let’s fit the model with interactions:

fit <- lm(valence ~ loudness * mode, data = df)
summary(fit)

## 
## Call:
## lm(formula = valence ~ loudness * mode, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39179 -0.12289 -0.00013  0.14712  0.42106 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.82557    0.09463   8.724 8.16e-14 ***
## loudness            0.05541    0.01638   3.384  0.00104 ** 
## modeMajor          -0.06058    0.13270  -0.457  0.64905    
## loudness:modeMajor -0.01186    0.02248  -0.528  0.59898    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2004 on 96 degrees of freedom
## Multiple R-squared:  0.1688, Adjusted R-squared:  0.1429 
## F-statistic: 6.501 on 3 and 96 DF,  p-value: 0.0004735

We can also draw the linear regression fits with the scatterplot:

ggplot(data = df, mapping = aes(x = loudness, y = valence, col = mode)) + 
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    facet_wrap(~ mode)

We can see a slight change in slope, but they look basically the same. This is more obvious when both are plotted on the same plot:

ggplot(data = df, mapping = aes(x = loudness, y = valence, col = mode)) + 
    geom_point() +
    geom_smooth(method = "lm", se = FALSE)

Optional material

Getting predictions from the model

How can we get predictions from the model? We can use the predict function, with the first argument being the model (i.e. output of lm), and the second argument being the data on which to predict. (The data must be in the exact same form as the data the model was trained on, columns and all.)

This code gives the predictions on the training dataset:

fit <- lm(valence ~ loudness, data = df)
predict(fit, df)

##         1         2         3         4         5         6         7 
## 0.6379884 0.5819175 0.5609092 0.4623810 0.5840722 0.4588062 0.4708529 
##         8         9        10        11        12        13        14 
## 0.5469037 0.5509193 0.3837838 0.4821161 0.4790799 0.5478342 0.5768246 
##        15        16        17        18        19        20        21 
## 0.3631673 0.5874511 0.6047376 0.5554735 0.5946497 0.5830438 0.5531719 
##        22        23        24        25        26        27        28 
## 0.5613010 0.4315787 0.5962658 0.4840259 0.4577289 0.6286351 0.4125783 
##        29        30        31        32        33        34        35 
## 0.5196763 0.5695280 0.5029774 0.4496977 0.6423468 0.6323079 0.4874048 
##        36        37        38        39        40        41        42 
## 0.4151737 0.2367261 0.4709508 0.5198721 0.5085110 0.4847605 0.5946497 
##        43        44        45        46        47        48        49 
## 0.6418081 0.5877449 0.5733966 0.4635563 0.2325636 0.6508186 0.5336818 
##        50        51        52        53        54        55        56 
## 0.5821623 0.6765280 0.3319243 0.5338776 0.5489115 0.5056218 0.6324058 
##        57        58        59        60        61        62        63 
## 0.5656104 0.4548396 0.2845210 0.5841701 0.4626749 0.5228104 0.4973948 
##        64        65        66        67        68        69        70 
## 0.5312332 0.5272177 0.4919101 0.3236972 0.4668373 0.6380864 0.5148282 
##        71        72        73        74        75        76        77 
## 0.5671284 0.6288310 0.4828506 0.4213440 0.4859357 0.6423957 0.3884359 
##        78        79        80        81        82        83        84 
## 0.6217303 0.4815284 0.5428392 0.5881857 0.6433751 0.5726131 0.5057687 
##        85        86        87        88        89        90        91 
## 0.5280012 0.4206584 0.4884332 0.5445042 0.4801572 0.5442104 0.4097380 
##        92        93        94        95        96        97        98 
## 0.5934255 0.5237408 0.5318699 0.3909334 0.5607133 0.5171298 0.4400016 
##        99       100 
## 0.5538085 0.4709998

Let’s plot these predictions on the scatterplot to make sure we got it right:

df$predictions <- predict(fit, df)
ggplot(data = df) +
    geom_point(aes(x = loudness, y = valence)) +
    geom_smooth(aes(x = loudness, y = valence), method = "lm", se = FALSE) +
    geom_point(aes(x = loudness, y = predictions), col = "red")

Here are the predictionrs on a randomly generated set of data:

new_df <- data.frame(loudness = c(-10, -3.6, -3.74, -8, -5.22))
predict(fit, new_df)

##         1         2         3         4         5 
## 0.3041581 0.6175678 0.6107120 0.4020986 0.5382360

More information from the model

If you plot the output of the lm call, you will get a series of plots which give you more information on the fit:

fit <- lm(valence ~ loudness, data = df)
plot(fit)

Confidence intervals

The following code gives the mean tempo for all the songs:

mean(df$tempo)

## [1] 119.2025

We can use the following code to get an \(x\)% confidence interval for the mean tempo:

x <- 0.9
t.test(df$tempo, conf.level = x)

## 
##  One Sample t-test
## 
## data:  df$tempo
## t = 42.644, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 90 percent confidence interval:
##  114.5612 123.8437
## sample estimates:
## mean of x 
##  119.2025

To get confidence intervals for parameters in a linear model, we can use the confint function (if level = x in the confint call, it will give the endpoints of the x% confidence interval):

x <- 0.95
fit <- lm(valence ~ loudness, data = df)
confint(fit, level = x)

##                  2.5 %     97.5 %
## (Intercept) 0.66349030 0.92423126
## loudness    0.02698615 0.07095438

Plotting song labels

It might be informative to have the song names in the scatterplot, especially when we want to identify outliers. The following plot is pretty crowded, so putting all the song names might not be a good idea. In any case, here is how you can do it (play around with the different options in geom_text to see what they do):

ggplot(data = df, mapping = aes(x = loudness, y = valence)) +
    geom_point() + 
    geom_text(aes(label = name, col = mode),
              hjust = 0, nudge_x = 0.1, angle = 45, size = 3) +
    facet_grid(. ~ mode)

It’s a little bit of a mess. We can do slightly better by loading the ggrepel package, and replacing geom_text with geom_text_repel (and some change of options):

library(ggrepel)
ggplot(data = df, mapping = aes(x = loudness, y = valence)) +
    geom_point() + 
    geom_text_repel(aes(label = name, col = mode), size = 3) +
    facet_grid(. ~ mode)

Drawing the linear fit for an additive model

I couldn’t find an easy way to draw the linear fits for an additive model in the facetted scatterplot. (If you find a way, let me know!)

This is a workaround:

# get fit coefficients
fit <- lm(valence ~ loudness + mode, data = df)
coefs <- fit$coefficients

# add coefficients to dataset
df$slope <- coefs[2]
df$intercept <- ifelse(df$mode == "Minor", coefs[1], coefs[1] + coefs[3])

# plot
ggplot(data = df, mapping = aes(x = loudness, y = valence, col = mode)) + 
    geom_point() +
    geom_abline(aes(slope = slope, intercept = intercept, col = mode)) +
    facet_wrap(~ mode)